Introduction
1. Data Description
1.1. Sample Selected
You can define bullet list or numbered list:
- Pizza
- Pasta
- Cafe
- Espresso
- Macchiato
- Cappuccino
- Vodka
1.3. Import Data (CSV) [*]
data_SD3 <- read.delim("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D1_SD3/data_SD3.csv", stringsAsFactors=TRUE)
2. Data Analysis
To add R code in the Notebook we need to use the
Chunk.
X <- iris
It is possible to have an overview of the data by using the
summary function.
summary(X)
Sepal.Length Sepal.Width Petal.Length
Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median :5.800 Median :3.000 Median :4.350
Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :7.900 Max. :4.400 Max. :6.900
Petal.Width Species
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
Mean :1.199
3rd Qu.:1.800
Max. :2.500
In R there are three main type of data:
- Matrix. Mathematical Object. In our example
Y is a matrix.
- Data Frame. It is devoted to organize and analyze
data and it is a generalization of the Matrix. In our example,
X is a data frame.
- List. It is an object that can include Data Frames,
Matrices or other lists.
Y <- as.matrix(X[ ,1:4])
To handle data you can use the following code:
X[10, 2] # selection of one element in the Data Frame (or matrix)
[1] 3.1
X[5:20, 1:3] # selection of an interval
X[5:20, ] # the empty space select all the columns or rows
X$Sepal.Length # the symbol $ is used to select a column in the data frame
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8
[14] 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0
[27] 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4
[40] 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4
[53] 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6
[66] 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7
[79] 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5
[92] 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3
[105] 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5
[118] 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2
[131] 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8
[144] 6.8 6.7 6.7 6.3 6.5 6.2 5.9
2.1. Plots in R
boxplot(X$Sepal.Length, main = "Box Plot of the Sepal Length of the IRIS Flowers", col = "blue", horizontal = T)
Error in if (horizontal) plot.window(ylim = xlim, xlim = ylim, log = log, :
the condition has length > 1

The elements of the box-plot are reported below:
- Q1. It is the First Quartile. It leaves
25% of units on the left and 75% of units on the right. Left side of
the box
- Me. It is the Median=Q2. It leaves 50% of
units on left and right. Bold Line in the middle of the
box
- Q3. It is the Third Quartile. It leaves
75% of the units on the left and 25% of the units on the right.
Right side of the box
- In case of no outliers, the whiskers are defined
as:
- Xmin is the left whiskers
- Xmax is the right whiskers
- In case of outliers, the whiskers are defined as:
- Linf = Q1 - 1.5 (Q3-Q1): Lower Limit
- Lsup = Q1 + 1.5 (Q3-Q1): Upper Limit
# just a boxplot
boxplot(X[ ,1:4])

# boxplot with a title and colors
boxplot(X[ ,1:4], main = "Box Plot of all quantitative variables of IRIS data", col = terrain.colors(4))

boxplot(X$Sepal.Width ~ X$Species, main = "Box Plot of Sepal Width considering the 3 types of flowers", xlab = "Type of IRIS Flowers", ylab = "Sepal Width", col = terrain.colors(4))

The tilde symbol is obtained by:
- these doesn’t work [*]:
- MAC: option + 5
- WIN: ALT + 125/6
- these works [*]:
- MAC: shift + button before no.1
- WIN: shift + button before no.1
[ 2.1. BONUS ]
The bold line is the Median, that is the value of
the ordered distribution that leaves the same number of units above and
below (or on left and right)
- Q1 is the first quartile. Q1 leaves 25% of unirs on
left and 75% on right.
- Q3 is the third quartile. Q3 leaves 75% of units on
left and 25% on right.
- Wishers, without outliers are, the min and max of
distribution
- Wishers, with outliers are, Lmin =
Q1-1.5(Q3-Q1); Lsup = Q3+1.5(Q3-Q1);
boxplot(X$Sepal.Length, main = "Box-Plot of the Sepal Lenght", col = "green", horizontal = F)

boxplot(X[ ,1:4], main = "Box-Plot with all the Variables", col = "blue", horizontal = F)

boxplot(X$Sepal.Width ~ X$Species, main = "Box-Plot about Sepal Width with different type of IRIS Flowers")

2.2. Bar Plot
Bar Plot can be used for Qualitative Data and for
Categorized Quantitative Data. The first step to create
a Bar Plot is to generate a Table of Frequency.
T <- table(X$Species)
T
setosa versicolor virginica
50 50 50
barplot(T, main = "Bar Plot of Type of flowers", xlab = "Type of flowers", ylab = "Absolute Frequency", col = terrain.colors(4))

2.3. Pie Chart
It is based on the frequency table.
pie(T, main = "Pie Chart", col = terrain.colors(4))

2.4. Histogram Chart
Histogram is a plot used only for Quantitative Data,
it is based on a frequency tables in classes. The R function is
called hist and the input is a simple distribution of a
quantitative variable.
hist(X$Sepal.Length)

hist(X$Sepal.Width, main = "Histogram of Sepal Width", xlab = "Classes", ylab = "Absolute Frequency", col = "lightgreen", border = "blue")

[ 2.4. BONUS ]
he histogram can be used only for quantitative
variables.
hist(X$Sepal.Width, main = "Histogram of the Sepal Width", xlab = "Classes",
ylab = "Absolute Frequency", col = "green", border = "red", breaks = 10)

In case of equally spaced (same size) classes we can report on the
Y axis the Absolute Frequency or relative frequency. In case of
classes with different sizes we have to report on Y axis the
density of frequency. The formula is the following: \(d_i = n_i/h_i\), where \(n_i\) is the absolute frequency and \(h_i\) is the size of the class.
3. Correlation Analysis
3.1. Correlation Plot
plot(X$Sepal.Length, X$Sepal.Width, main = "Correlation Plot", xlab = "Sepal Length", ylab = "Sepal Width")

[ 3.1. BONUS ]
plot(X$Petal.Length, X$Petal.Width, main = "Correlation Plot",
xlab = "Petal Lenght", ylab = "Petal Width", col = "blue",
pch = 1)

Plots with IRIS
plot(X$Sepal.Length, X$Sepal.Width, main = "(1-2) Plot with IRIS",
xlab = "Sepal Lenght", ylab = "Sepal Width", col = "blue")

plot(X$Petal.Length, X$Petal.Width, main = "(2-2) Plot with IRIS",
xlab = "Petal Lenght", ylab = "Petal Width", col = "red")

3.2. Pair Plot
pairs(X[ ,1:4])

The plots below the main diagonal are the same of the plot above the
main diagonal. The reason is because the plot and the correlation index
are symmetric.
r <- cor(X[ ,1:2])
r <- round(r, 3)
r
Sepal.Length Sepal.Width
Sepal.Length 1.000 -0.118
Sepal.Width -0.118 1.000
cor(X$Sepal.Length, X$Sepal.Width)
[1] -0.1175698
The range of correlation index is: -1 <= r <= 1 The
interpretation of the Correlation Index called
r is following:
- 0.00 < |r| <= 0.25 Low Correlation
- 0.25 < |r| <= 0.50 Medium-Low Correlation
- 0.50 < |r| <= 0.75 Medium-High Correlation
- 0.75 < |r| <= 1.00 High Correlation
- 0 No Correlation
- 1 Perfect Correlation
The correlation between Sepal Length and Sepal
Width is -0.118 and it is a low negative correlation.
4. Plots with GGPLOT Package
install.packages("ggplot2")
Error in install.packages : Updating loaded packages
library(ggplot2)
GGPLOTS has 3 main arguments:
- The first argument is the data: ggplot(data = X). It
creates an empty frame.
- The second argument is the geometry (type of plot): geom_.
It adds a layer with the type of plot we want to show.
- The third argument is the aesthetic, to select the variables and the
properties: mapping = aes()
# GGPLOT (example no 1)
ggplot(data = X[ ,1:2])

# GGPLOT (example no 2.1)
ggplot(data = X[ ,1:2]) +
geom_point(mapping = aes(X$Sepal.Length, X$Sepal.Width))

# GGPLOT (example no 2.2)
ggplot(data = X[ ,1:2]) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width))

# GGPLOT (example no 2.2)
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width))

# GGPLOT (example no 3)
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width)) +
ggtitle("Scatter Plot") + xlab("Sepal Length") + ylab("Sepal Width")

4.1. Correlation Plot (Scatter Plot) with colors
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width, color = Species))

[ 4.1. BONUS ]
STANDARD TEMPLATE IS: ggplot(data = ) +
(mapping = aes())
# First Example
ggplot(data = X) +
geom_point(mapping = aes(Petal.Length, Petal.Width, color = Species)) +
ggtitle("Petal Lenght and Width") +
xlab("Petal Lenght") + ylab("Petal Width")

# Second Example
ggplot(data = X, mapping = aes(Petal.Width, Petal.Length)) +
geom_point(mapping = aes(color = Species)) +
ggtitle("Petal Width and Lenght") +
xlab("Petal Width") + ylab("Petal Lenght")

4.2. Box Plot
ggplot(data = X) +
geom_boxplot(mapping = aes(Sepal.Width), color = "blue", outlier.colour = "red", outlier.shape = 8, outlier.size = 3) +
ggtitle("Box Plot for Sepal Lenght")

Box Plot taking into account the 3 types of flowers
ggplot(data = X) +
geom_boxplot(mapping = aes(Species, Sepal.Width), outlier.color = "red", outlier.shape = 8)

GGPLOT function as object
p <- ggplot(data = X) +
geom_boxplot(mapping = aes(Species, Sepal.Width, fill = Species))
p

p + theme(legend.position = "bottom")

4.3. Bar Plot
ggplot(data = X) +
geom_bar(mapping = aes(Species)) +
ggtitle("Bar Plot with GGPLOT") +
ylab("Absolute Frequency")

5. Use the mpg data (mpg Fuel economy data from 1999 to 2008 for 38
popular model of cars)
A data frame with 234 rows and 11 variables:
- manufacturer brand name
- model model name
- displ engine displacement, in litres (power of the
engine)
- year year of manufacture
- cyl number of cylinders
- trans type of transmission
- drv the type of drive train, where f = front-wheel drive, r
= rear wheel drive, 4 = 4wd
- cty city miles per gallon (km per liter in town)
- hwy highway miles per gallon (km per liter in highway)
- fl fuel type
- class “type” of car
Y <- mpg
summary(Y)
manufacturer model displ
Length:234 Length:234 Min. :1.600
Class :character Class :character 1st Qu.:2.400
Mode :character Mode :character Median :3.300
Mean :3.472
3rd Qu.:4.600
Max. :7.000
year cyl trans
Min. :1999 Min. :4.000 Length:234
1st Qu.:1999 1st Qu.:4.000 Class :character
Median :2004 Median :6.000 Mode :character
Mean :2004 Mean :5.889
3rd Qu.:2008 3rd Qu.:8.000
Max. :2008 Max. :8.000
drv cty hwy
Length:234 Min. : 9.00 Min. :12.00
Class :character 1st Qu.:14.00 1st Qu.:18.00
Mode :character Median :17.00 Median :24.00
Mean :16.86 Mean :23.44
3rd Qu.:19.00 3rd Qu.:27.00
Max. :35.00 Max. :44.00
fl class
Length:234 Length:234
Class :character Class :character
Mode :character Mode :character
head(Y)
5.1. Bar Chart
table(Y$cyl)
4 5 6 8
81 4 79 70
# first example: factor(cyl) as a color ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = factor(cyl)))

# second example: class as a color ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class))

# third example: class instead cyl ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class))

# fourth example: class instead cyl ( horizontal bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class)) + coord_flip()

# fifth example: class instead cyl ( horizontal bar chart & legend at the bottom )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class)) + coord_flip() + theme(legend.position = "bottom")

5.2. Histogram
# example no 1.1
ggplot(data = Y) +
geom_histogram(mapping = aes(cty))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

# example no 1.2
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, colour = class))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

# example no 1.3
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = class))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

# example no 1.4
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = "red"))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

# example no 1.5
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = factor(cty)))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

# example 2
ggplot(data = Y) +
geom_histogram(mapping = aes(hwy))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.

5.3. Facet Wrap with 1 grup variable [*]
With Facet Wrap layer (option) we can create sub-plots based on a
categorical variable.
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
facet_wrap(~drv)

5.4. Facet Wrap with 2 grup variables [*]
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
facet_wrap(drv ~ class)

5.5. Smooting Plot [*]
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'

5.5.1. Smooting Plot with “different type of line” [*]
# First Example
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, linetype = class))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
span too small. fewer data values than degrees of freedom.
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
pseudoinverse used at 5.6935
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
neighborhood radius 0.5065
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
There are other near singularities as well. 0.65044
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
span too small. fewer data values than degrees of freedom.
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
pseudoinverse used at 5.6935
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
neighborhood radius 0.5065
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
reciprocal condition number 0
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
There are other near singularities as well. 0.65044
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
pseudoinverse used at 4.008
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
neighborhood radius 0.708
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
reciprocal condition number 1.6135e-17
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
There are other near singularities as well. 0.25
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
pseudoinverse used at 4.008
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
neighborhood radius 0.708
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
reciprocal condition number 1.6135e-17
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
There are other near singularities as well. 0.25

# Second Example
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, linetype = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'

5.5.2. Smooting Plot with “Facet Wrap” [*]
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy)) +
facet_wrap(~ drv)
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'

5.5.3. Smooting Plot with “color” [*]
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'

5.5.4. Smooting Plot with “group” [*]
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, group = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'

5.5.5. Smooting Plot combining different layers [*]
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy, color = drv)) +
geom_smooth(mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'

6. Regression Model [*]
In regression model we need to define the dependent and
independent variables. In our case the model is define as
follow:
- Y(Dependent/Outcome Variable) = hwy. The
variable defines the number of miles (km) per Gallon (liter) on the
highway.
- X(Independent/Input Variable) = displ. The
variable defines the power of the engine (horse power).
In the first place we need to create a scatter plot (correlation
plot).
6.1. Regression Model: Plot [*]
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
geom_smooth(method = lm, mapping = aes(displ, hwy))
`geom_smooth()` using formula = 'y ~ x'

6.2. Regression Model: Different Plot per each group [*]
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy, color = drv)) +
geom_smooth(method = lm, mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using formula = 'y ~ x'

6.3. Regression Model: Parameters Estimation [*]
res.reg <- lm(hwy ~ displ, data = Y)
summary(res.reg)
Call:
lm(formula = hwy ~ displ, data = Y)
Residuals:
Min 1Q Median 3Q Max
-7.1039 -2.1646 -0.2242 2.0589 15.0105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.6977 0.7204 49.55 <2e-16 ***
displ -3.5306 0.1945 -18.15 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.836 on 232 degrees of freedom
Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
---
title: "My First Notebook MDA2024"
output: html_notebook
---

# Introduction


# 1. Data Description

## 1.1. Sample Selected

You can define bullet list or numbered list:

- Pizza
- Pasta
- Cafe
  - Espresso
  - Macchiato
  - Cappuccino
- Vodka
  - Bisont Grass
  - Soplica

## 1.2. Formula

Here you can define formula

$$ y = \beta_0 + \beta_1X + \epsilon $$
the formula can be reported in the text: $\mu = 1/n \sum X_i$

## 1.3. Import Data (CSV) [*]

```{r}
data_SD3 <- read.delim("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D1_SD3/data_SD3.csv", stringsAsFactors=TRUE)
```

# 2. Data Analysis
To add R code in the Notebook we need to use the **Chunk**.

```{r}
X <- iris
```

It is possible to have an overview of the data by using the *summary* function.

```{r}
summary(X)
```
In R there are three main type of data:

- **Matrix**. Mathematical Object. In our example **Y** is a matrix.
- **Data Frame**. It is devoted to organize and analyze data and it is a generalization of the Matrix. In our example, **X** is a data frame.
- **List**. It is an object that can include Data Frames, Matrices or other lists.

```{r}
Y <- as.matrix(X[ ,1:4])
```

To handle data you can use the following code:

```{r}
X[10, 2]       # selection of one element in the Data Frame (or matrix)
X[5:20, 1:3]   # selection of an interval
X[5:20, ]      # the empty space select all the columns or rows
X$Sepal.Length # the symbol $ is used to select a column in the data frame
```
# 2.1. Plots in R

```{r}
boxplot(X$Sepal.Length, main = "Box Plot of the Sepal Length of the IRIS Flowers", col = "blue", horizontal = T)
```
The elements of the box-plot are reported below:

- **Q1**. It is the *First Quartile*. It leaves 25% of units on the left and 75% of units on the right. *Left side of the box*
- **Me**. It is the *Median=Q2*. It leaves 50% of units on left and right. *Bold Line in the middle of the box*
- **Q3**. It is the *Third Quartile*. It leaves 75% of the units on the left and 25% of the units on the right. *Right side of the box*
- In case of **no outliers**, the whiskers are defined as:
  - **Xmin** is the left whiskers
  - **Xmax** is the right whiskers
- In case of **outliers**, the whiskers are defined as:
  - **Linf** = Q1 - 1.5 (Q3-Q1): Lower Limit
  - **Lsup** = Q1 + 1.5 (Q3-Q1): Upper Limit
  
```{r}
# just a boxplot
boxplot(X[ ,1:4])
# boxplot with a title and colors
boxplot(X[ ,1:4], main = "Box Plot of all quantitative variables of IRIS data", col = terrain.colors(4))
```
  
```{r}
boxplot(X$Sepal.Width ~ X$Species, main = "Box Plot of Sepal Width considering the 3 types of flowers", xlab = "Type of IRIS Flowers", ylab = "Sepal Width", col = terrain.colors(4))
```

The *tilde symbol* is obtained by:

- these doesn't work [*]:
  - MAC: option + 5
  - WIN: ALT + 125/6
- these works [*]:
  - MAC: shift + button before no.1
  - WIN: shift + button before no.1

## [ 2.1. BONUS ]

The bold line is the **Median**, that is the value of the ordered distribution that leaves the same number of units above and below (or on left and right)

- **Q1** is the first quartile. Q1 leaves 25% of unirs on left and 75% on right.
- **Q3** is the third quartile. Q3 leaves 75% of units on left and 25% on right.
- Wishers, **without outliers** are, the min and max of distribution
- Wishers, **with outliers** are, Lmin = Q1-1.5*(Q3-Q1); Lsup = Q3+1.5*(Q3-Q1);

```{r}
boxplot(X$Sepal.Length, main = "Box-Plot of the Sepal Lenght", col = "green", horizontal = F)
boxplot(X[ ,1:4], main = "Box-Plot with all the Variables", col = "blue", horizontal = F)
boxplot(X$Sepal.Width ~ X$Species, main = "Box-Plot about Sepal Width with different type of IRIS Flowers")
```

# 2.2. Bar Plot

Bar Plot can be used for **Qualitative Data** and for **Categorized Quantitative Data**. The first step to create a Bar Plot is to generate a *Table of Frequency*.

```{r}
T <- table(X$Species)
T
```

```{r}
barplot(T, main = "Bar Plot of Type of flowers", xlab = "Type of flowers", ylab = "Absolute Frequency", col = terrain.colors(4))
```

# 2.3. Pie Chart

It is based on the frequency table.

```{r}
pie(T, main = "Pie Chart", col = terrain.colors(4))
```

# 2.4. Histogram Chart

Histogram is a plot used only for **Quantitative Data**, it is based on a frequency tables in classes. The *R* function is called *hist* and the input is a simple distribution of a quantitative variable.

```{r}
hist(X$Sepal.Length)
```

```{r}
hist(X$Sepal.Width, main = "Histogram of Sepal Width", xlab = "Classes", ylab = "Absolute Frequency", col = "lightgreen", border = "blue")
```

## [ 2.4. BONUS ]

he histogram can be used only for **quantitative variables**. 

```{r}
hist(X$Sepal.Width, main = "Histogram of the Sepal Width", xlab = "Classes",
     ylab = "Absolute Frequency", col = "green", border = "red", breaks = 10)

```

In case of equally spaced (same size) classes we can report on the *Y* axis the Absolute Frequency or relative frequency. In case of classes with different sizes we have to report on *Y* axis the *density of frequency*. The formula is the following: $d_i = n_i/h_i$, where $n_i$ is the absolute frequency and $h_i$ is the size of the class.

# 3. Correlation Analysis

## 3.1. Correlation Plot

```{r}
plot(X$Sepal.Length, X$Sepal.Width, main = "Correlation Plot", xlab = "Sepal Length", ylab = "Sepal Width")
```

## [ 3.1. BONUS ]

```{r}
plot(X$Petal.Length, X$Petal.Width, main = "Correlation Plot",
     xlab = "Petal Lenght", ylab = "Petal Width", col = "blue",
     pch = 1)
```

Plots with IRIS

```{r}
plot(X$Sepal.Length, X$Sepal.Width, main = "(1-2) Plot with IRIS", 
     xlab = "Sepal Lenght", ylab = "Sepal Width", col = "blue")
plot(X$Petal.Length, X$Petal.Width, main = "(2-2) Plot with IRIS", 
     xlab = "Petal Lenght", ylab = "Petal Width", col = "red")
```

## 3.2. Pair Plot

```{r}
pairs(X[ ,1:4])
```

The plots below the main diagonal are the same of the plot above the main diagonal. The reason is because the plot and the correlation index are symmetric.

```{r}
r <- cor(X[ ,1:2])
r <- round(r, 3)
r
cor(X$Sepal.Length, X$Sepal.Width)

```

The range of correlation index is: *-1 <= r <= 1*
The interpretation of the *Correlation Index* called **r** is following:

- 0.00 < |r| <= 0.25 *Low Correlation*
- 0.25 < |r| <= 0.50 *Medium-Low Correlation*
- 0.50 < |r| <= 0.75 *Medium-High Correlation*
- 0.75 < |r| <= 1.00 *High Correlation*
- 0 *No Correlation*
- 1 *Perfect Correlation*

The correlation between *Sepal Length* and *Sepal Width* is -0.118 and it is a low negative correlation.

# 4. Plots with GGPLOT Package

```{r}
install.packages("ggplot2")
library(ggplot2)
```

GGPLOTS has 3 main arguments:

- The first argument is the data: *ggplot(data = X)*. It creates an empty frame.
- The second argument is the geometry (type of plot): *geom_*. It adds a layer with the type of plot we want to show.
- The third argument is the aesthetic, to select the variables and the properties: *mapping = aes()*

```{r}
# GGPLOT (example no 1)
ggplot(data = X[ ,1:2])
# GGPLOT (example no 2.1)
ggplot(data = X[ ,1:2]) +
  geom_point(mapping = aes(X$Sepal.Length, X$Sepal.Width))
# GGPLOT (example no 2.2)
ggplot(data = X[ ,1:2]) +
  geom_point(mapping = aes(Sepal.Length, Sepal.Width))
# GGPLOT (example no 2.2)
ggplot(data = X) +
  geom_point(mapping = aes(Sepal.Length, Sepal.Width))
# GGPLOT (example no 3)
ggplot(data = X) +
  geom_point(mapping = aes(Sepal.Length, Sepal.Width)) +
  ggtitle("Scatter Plot") + xlab("Sepal Length") + ylab("Sepal Width")
```

## 4.1. Correlation Plot (Scatter Plot) with colors

```{r}
ggplot(data = X) +
  geom_point(mapping = aes(Sepal.Length, Sepal.Width, color = Species))
```

## [ 4.1. BONUS ]

**STANDARD TEMPLATE IS:**
  ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

```{r}
# First Example
ggplot(data = X) + 
  geom_point(mapping = aes(Petal.Length, Petal.Width, color = Species)) +
  ggtitle("Petal Lenght and Width") +
  xlab("Petal Lenght") + ylab("Petal Width")
# Second Example
ggplot(data = X, mapping = aes(Petal.Width, Petal.Length)) + 
  geom_point(mapping = aes(color = Species)) +
  ggtitle("Petal Width and Lenght") + 
  xlab("Petal Width") + ylab("Petal Lenght")
```

## 4.2. Box Plot

```{r}
ggplot(data = X) +
  geom_boxplot(mapping = aes(Sepal.Width), color = "blue", outlier.colour = "red", outlier.shape = 8, outlier.size = 3) +
  ggtitle("Box Plot for Sepal Lenght")
```

Box Plot taking into account the 3 types of flowers

```{r}
ggplot(data = X) +
  geom_boxplot(mapping = aes(Species, Sepal.Width), outlier.color = "red", outlier.shape = 8)
```

GGPLOT function as object

```{r}
p <- ggplot(data = X) +
  geom_boxplot(mapping = aes(Species, Sepal.Width, fill = Species))
p
p + theme(legend.position = "bottom")
```

## 4.3. Bar Plot

```{r}
ggplot(data = X) + 
  geom_bar(mapping = aes(Species)) + 
  ggtitle("Bar Plot with GGPLOT") + 
  ylab("Absolute Frequency")
```

# 5. Use the mpg data (mpg Fuel economy data from 1999 to 2008 for 38 popular model of cars)

A data frame with 234 rows and 11 variables:

- *manufacturer* brand name
- *model* model name
- *displ* engine displacement, in litres (power of the engine)
- *year* year of manufacture
- *cyl* number of cylinders
- *trans* type of transmission
- *drv* the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd
- *cty* city miles per gallon (km per liter in town)
- *hwy* highway miles per gallon (km per liter in highway)
- *fl* fuel type
- *class* "type" of car

```{r}
Y <- mpg
```

```{r}
summary(Y)
head(Y)
```

## 5.1. Bar Chart

```{r}
table(Y$cyl)
```

```{r}
# first example: factor(cyl) as a color ( vertical bar chart )
ggplot(data = Y) +
  geom_bar(mapping = aes(cyl, fill = factor(cyl)))
# second example: class as a color ( vertical bar chart )
ggplot(data = Y) +
  geom_bar(mapping = aes(cyl, fill = class))
# third example: class instead cyl ( vertical bar chart )
ggplot(data = Y) +
  geom_bar(mapping = aes(cyl, fill = class))
# fourth example: class instead cyl ( horizontal bar chart )
ggplot(data = Y) +
  geom_bar(mapping = aes(cyl, fill = class)) + coord_flip()
# fifth example: class instead cyl ( horizontal bar chart & legend at the bottom )
ggplot(data = Y) +
  geom_bar(mapping = aes(cyl, fill = class)) + coord_flip() + theme(legend.position = "bottom")
```

## 5.2. Histogram

```{r}
# example no 1.1
ggplot(data = Y) +
  geom_histogram(mapping = aes(cty))
# example no 1.2
ggplot(data = Y) +
  geom_histogram(mapping = aes(cty, colour = class))
# example no 1.3
ggplot(data = Y) +
  geom_histogram(mapping = aes(cty, fill = class))
# example no 1.4
ggplot(data = Y) +
  geom_histogram(mapping = aes(cty, fill = "red"))
# example no 1.5
ggplot(data = Y) +
  geom_histogram(mapping = aes(cty, fill = factor(cty)))
# example 2
ggplot(data = Y) +
  geom_histogram(mapping = aes(hwy))
```

## 5.3. Facet Wrap with 1 grup variable [*]

With Facet Wrap layer (option) we can create sub-plots based on a categorical variable.

```{r}
ggplot(data = Y) +
  geom_point(mapping = aes(displ, hwy)) +
  facet_wrap(~drv)
```

## 5.4. Facet Wrap with 2 grup variables [*]

```{r}
ggplot(data = Y) +
  geom_point(mapping = aes(displ, hwy)) +
  facet_wrap(drv ~ class)
```

## 5.5. Smooting Plot [*]

```{r}
ggplot(data = Y) +
  geom_smooth(mapping = aes(displ, hwy))
```

### 5.5.1. Smooting Plot with "different type of line" [*]

```{r}
# First Example
ggplot(data = Y) +
  geom_smooth(mapping = aes(displ, hwy, linetype = class))
# Second Example
ggplot(data = Y) +
  geom_smooth(mapping = aes(displ, hwy, linetype = drv))
```

### 5.5.2. Smooting Plot with "Facet Wrap" [*]

```{r}
ggplot(data = Y) +
  geom_smooth(mapping = aes(displ, hwy)) +
  facet_wrap(~ drv)
```

### 5.5.3. Smooting Plot with "color" [*]

```{r}
ggplot(data = Y) +
  geom_smooth(mapping = aes(displ, hwy, color = drv))
```

### 5.5.4. Smooting Plot with "group" [*]

```{r}
ggplot(data = Y) +
  geom_smooth(mapping = aes(displ, hwy, group = drv))
```

### 5.5.5. Smooting Plot combining different layers [*]

```{r}
ggplot(data = Y) +
  geom_point(mapping = aes(displ, hwy, color = drv)) +
  geom_smooth(mapping = aes(displ, hwy, color = drv))
```

# 6. Regression Model [*]

In regression model we need to define the *dependent* and *independent* variables. In our case the model is define as follow:

- *Y(Dependent/Outcome Variable)* = **hwy**. The variable defines the number of miles (km) per Gallon (liter) on the highway.
- *X(Independent/Input Variable)* = **displ**. The variable defines the power of the engine (horse power).

In the first place we need to create a scatter plot (correlation plot).

## 6.1. Regression Model: Plot [*]

```{r}
ggplot(data = Y) +
  geom_point(mapping = aes(displ, hwy)) +
  geom_smooth(method = lm, mapping = aes(displ, hwy))
```

## 6.2. Regression Model: Different Plot per each group [*]

```{r}
ggplot(data = Y) +
  geom_point(mapping = aes(displ, hwy, color  = drv)) +
  geom_smooth(method = lm, mapping = aes(displ, hwy, color = drv))

```

## 6.3. Regression Model: Parameters Estimation [*]

```{r}
res.reg <- lm(hwy ~ displ, data = Y)
summary(res.reg)
```



